% Nicholas Arnold
% Steven Braeger
% CDA 5106
%
% Project Final Report
%
% We used the template from IEEE website, linked from
% http://www.ieee.org/conferences_events/conferences/publishing/templates.html	

\documentclass[conference]{IEEEtran}

\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{array}
\usepackage{listings}
\usepackage{enumerate}

% *** CITATION PACKAGES ***
%
%\usepackage{cite}
% cite.sty was written by Donald Arseneau
% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
% \cite{} output to follow that of IEEE. Loading the cite package will
% result in citation numbers being automatically sorted and properly
% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
% \cite will automatically add leading space, if needed. Use cite.sty's
% noadjust option (cite.sty V3.8 and later) if you want to turn this off.
% cite.sty is already installed on most LaTeX systems. Be sure and use
% version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
% not currently provide for hyperlinked citations.
% The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
% The documentation is contained in the cite.sty file itself.

% *** MATH PACKAGES ***
%
%\usepackage[cmex10]{amsmath}
% *** SPECIALIZED LIST PACKAGES ***
%
%\usepackage{algorithmic}
% algorithmic.sty was written by Peter Williams and Rogerio Brito.
% This package provides an algorithmic environment fo describing algorithms.
% You can use the algorithmic environment in-text or within a figure


% *** ALIGNMENT PACKAGES ***
%
%\usepackage{array}
%\usepackage{mdwmath}
%\usepackage{mdwtab}
%\usepackage{eqparbox}

% *** SUBFIGURE PACKAGES ***
%\usepackage[tight,footnotesize]{subfigure}

%\usepackage[caption=false]{caption}
%\usepackage[font=footnotesize]{subfig}
%\usepackage{fixltx2e}
%\usepackage{stfloats}
% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
% the ability to do double column floats at the bottom of the page as well
% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
% LaTeX2e). It also provides a command:
%\fnbelowfloat
% to enable the placement of footnotes below bottom floats (the standard
% LaTeX2e kernel puts them above bottom floats). This is an invasive package

% Other Packages
\usepackage{moreverb}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{url}

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}


\begin{document}
%
% paper title
% can use linebreaks \\ within to get better formatting as desired
\title{Branch Prediction Sharing in SIMT Architectures: Exploiting Peer Information on Future Events}


% author names and affiliations
% use a multiple column layout for up to three different
% affiliations
\author{\IEEEauthorblockN{Steven Braeger}
\IEEEauthorblockA{Department of Electrical Engineering and\\Computer Science\\
University of Central Florida\\
Orlando, Florida 32826\\
Email: \texttt{steve@soapforge.com}}
\and\IEEEauthorblockN{Nicholas Arnold}
\IEEEauthorblockA{Department of Electrical Engineering and\\Computer Science\\
University of Central Florida\\
Orlando, Florida 32826\\
Email: \texttt{narnold@knights.ucf.edu}}
}

\maketitle

\begin{abstract} 

Branch Prediction has long been used as a method of reducing the latency incurred by a control hazard.  However, traditionally, SIMT architectures 
do not implement branch prediction hardware in order to reduce the complexity of each of the GPU cores.  Although the SIMT model provides some 
opportunities for control hazard penalty reduction in the form of context switching, there are several cases where the penalty is incurred regardless.
We experimented with the application of a branch predictor to a SIMT architecture in order to reduce those hazards.  Furthermore, our model is 
capable of sharing information about symmetric runs from multiple concurrent contexts.  By using a shared Smith\_2 branch predictor \cite{smith} with a large branch history table,
we can demonstrate significantly increased prediction accuracy over a traditional single-context scheme.

\end{abstract}

\IEEEpeerreviewmaketitle


\section{Introduction}
There is a great deal of interest in the evolution of Single-Instruction Multiple-Thread (SIMT) architectures in recent years.  Due to
their tremendous performance and parallelism, as well as an increased generality of these architectures for scientific applications, these 
types of architectures have moved beyond their traditional application in graphics programming to many other application domains, and are becoming more and more useful.
In addition, another factor driving their development is their rapid architectural evolution, which is facilitated by the fact that they do not have a standard instruction set
to implement.  

The SIMT model, which we describe in more detail in \ref{simt-description}, is not without its flaws.  In order to achieve greater parallelism and 
FLOPS, SIMT processors often remove a lot of the superscalar architectural features such as caching, predication, branch prediction,out-of-order execution, and sometimes even pipelining.
These features are replaced with multiple concurrent contexts of execution, known as 'threads'.

However, we believe that benefit can be gained from re-integrating one of those superscalar features: specifically, branch prediction, to gain 
performance by reducing the latency from control hazards that occur when the incorrect instruction is fetched during a jump.  

In addition, we make the observation that branch prediction on a SIMT architecture can actually perform better then branch prediction on a MIMD architecure like a modern multicore CPU.
The intuition for this observation, described in \ref{intuition}, derives from the fact that the SIMT model guarantees that each context of execution
is not working alone, but in teams, called context groups or 'warps', and each of these context groups is also working on a single core.  Whereas traditional branch prediction
is limited to predicting based on knowledge gained from the past, SIMT branch prediction schemes are able to gain information about the present and near-future, by sharing information
about execution with its peer contexts and context-groups.

We demonstrate an implementation of this scheme using a Smith\_2 predictor \cite{smith} described in \ref{predictor_table}.  Our implementation runs in simulation \cite{ispass09},
and demonstrates the impact of sharing no information (traditional MIMD model), sharing information per context group, and sharing information per core.

Our experiment demonstrates that utilizing peer information in the SIMT model provides a significantly improved prediction accuracy rate over the MIMD model
for a subset of the ISPASS09 \cite{ispass09} benchmarking set.

\section{Related Work}

\subsection{Branch Prediction}

Yeh and Patt \cite{Yeh91} discussed implementing an adaptive branch predictor that would use information being collected at runtime, as opposed to other schemes which would require a pre-run of the program for training.  Similar to our approach, their setup requires a Branch History Table and a Branch History Pattern Table, which is updated based on the outcomes of the branches during run-time.

Eden and Mudge \cite{yags98} developed YAGS, or Yet Another Global Scheme, which is a combination of several strong points of previously implemented schemes.  Using information from implementations such as GShare, the 
Agree Predictor, the Bi-Mode Predictor, the Skewed Branch Predictor, and the Filter Mechanism, the authors 
use some schemes to implement a predictor to store the biases of branches and other schemes to implement a predictor to store the 
direction of those branches.  The two varieties of predictor catch not only the regular behavior of branches, but also the special cases instead of just allowing them to be mispredictions.

Pan, So, and Rahmeh \cite{Pan92}, also attempted improved branch prediction, this time by referencing a subhistory of branch in addition to the branch history table itself.  They claim that by adding only a shift register to the architecture, they were able to achieve an additional 11\% prediction accuracy.

%\subsection{GPU Architectures}

%We next wanted to look at a variety of GPU architecture literature to see what is currently being done on the GPU with regards to branch prediction.  We found several SIGGRAPH course presentations which gave us a large amount of information about GPU architectures \cite{Luebke08, Sig07, Sig09}.  In the NVIDIA Research presentation, we learned that the GPU architecture has become less and less hardward bound as more and more parts of the GPU pipeline are allowing user programmability.  This presentation also gave us some insight as to how the GPU assigns threads to cores in groups called 'warps'.

%The SIGGRAPH 2009 course showed in more detail the hardware differences between CPUs and GPUs.  It also showed us how the GPU handles the large amount of elements assigned to each core and how they currently avoid stalls due to branches.  The fact is, they don't handle stalls at all - when they encounter a branch, they just switch do a different context group and execute the same set of instructions they just completed on the previous group.  Once the GPU has finished doing this chuck of each context group, the branch from the first context group has most of the time been resolved and is ready to continue.

%The SIGGRAPH 2007 talk \cite{SIG07} talks more about the progression of the use of GPUs becoming more and more commonplace due to the additional programmability.  It also had a discussion about generalizing the GPU pipeline to allow for better understanding and interactions between varying GPU architectures.  This is important to us because if different GPU architects are not working under some sort of uniformity, then our work here might have vastly different outcomes down the road based on the hardware selection.

\subsection{Branch Prediction on GPU}

He and Zhang \cite{Zhang09} also attempt branch prediction on the GPU.  Their focus, however, was to to test out parallel branch-prediction on GPUs for future use in general purpose many-core CPUs.  Their goals were to get independent branch predictors operating on different cores to cooperate with each other in an effort to:

\begin{enumerate}
	\item increase prediction accuracy,
	\item dynamically combine predictors to make them more powerful, or
	\item switch them off if they are behaving the same in order to save power.
\end{enumerate}

However, the authors use the GPU to test CPU-style computations, not exploiting the parallel neighbor efforts of the SIMT model.  Their efforts apply to CPU-style runtime paradigms but tested on a GPU, compared to our approach that truely exploits the SIMT nature of GPU-style computations.

\section{SIMT vs MIMD }
\label{simt-description}
In general, CPU architecture is different than GPU architecture in a few ways.  CPU architecture includes logic for the fetch and decode phases, an ALU for execution, the execution context (registers, etc.), control logic, branch predictor, memory pre-fetcher, and the data cache.  With this setup, a general CPU is designed to perform Single Instruction, Single Data (SISD) operations well, but performs poorly on Single Instruction, Multiple Data (SIMD), Multiple Instruction, Single Data (MISD), and Multiple Instruction, Multiple Data (MIMD).  GPU architecture removes all of the logic designed to make single threaded processes run more efficiently.  There is no control logic, no branch predictor, no memory pre-fetcher, and no data cache \cite{SIG09}.  With this architecture, a general GPU is designed to perform SIMD operations well, but performs poorly on SISD, MISD, and MIMD. 

\begin{center}
		\includegraphics[width=.2\textwidth]{CPU-design.jpg} 
		\captionof{figure}{A general-purpose CPU core} 
\end{center}

In contrast, a SIMT architecture often has none of these featuresets, but has many more concurrent ALU threads or 'contexts' on the die.

\begin{center}
	\includegraphics[width=.2\textwidth]{GPU-Design.jpg}
	\captionof{figure}{a single simple GPU core}
\end{center}

Recently, CPU designers have begun building chips with multiple cores.  This gives the CPU the ability to run SISD and MISD operations well, however there are still performance issues with SIMD and MIMD.  GPU design has been taking advantage of multiple cores per chip for years, and since the archtecture takes up less room than a traditional CPU core, a GPU is capable of fitting many cores on the chip \cite{Luebke08}.  This gives the GPU the ability to run SIMD and MIMD operations well.  However, due to the multiple contexts per core, and how the data elements are assigned to these cores, the GPU is not well designed to handle SISD or MISD operations well.

\begin{center}
		\includegraphics[width=.2\textwidth]{Multicore-CPU-design.jpg} 
		\captionof{figure}{A multicore CPU design} 
\end{center}

\begin{center}
	\includegraphics[width=.2\textwidth]{GPU-Design---multiple-cores.jpg}
	\captionof{figure}{A multicore GPU design}
\end{center}

GPU cores have the ability to modify their context storage areas to vary the number of context elements it holds by modifying the size of the context elements \cite{SIG09}, which affects the number of context groups for the core to alternate between during computation.  These alternating context groups is how the GPU tries to hide data stalls.  When a group encounters a branch, since there is no forwarding logic, the group must wait for the data.  While this is happening, the core switches to the next group and performs the same instructions that it just executed on the previous group.  By alternating these groups in this manner, called interleaving, the GPU can hide the vast majority of data stalls or branch calculation waits.

\begin{center}
	\includegraphics[width=.45\textwidth]{GPU-context-interleaving.jpg}
	\captionof{figure}{GPU Context switching}
\end{center}

However, what if the stall time exceeds the amount of time it would take this particular GPU core to alternate amongst all if its context groups?  In this case, there is nothing that can be done to avoid the stall, and the core must wait.  But what if this could be avoided?  Our proposal will endeavor to solve this problem.
\begin{center}
	\includegraphics[width=.45\textwidth]{GPU-context-interleaving-2.jpg}
	\captionof{figure}{GPU context switch failure.}
\end{center}

The collection of all context belonging to one GPU core is called a 'warp', and the number of contexts in a warp is  its 'warp factor' \cite{Luebke08}.  All contexts within a warp will be subjected to the same instruction stream.  As mentioned before, these contexts will be processed context group by context group.  Due to the nature of how data is assigned to these context elements, there is a great deal of spatial locality amongst the contexts in a warp.  Our proposal believes that there is a way to use the information discovered by the first context group being processed by an instruction stream to assist later context groups in their knowledge of stalls.

On a CPU, each core only has one execution context and one branch prediction table.  This shows that each context handles its own branches and receives no external assistance from other threads.  No external assistance can be offered becuase there is no way to guarantee what the other threads are working on and if it could be useful to the current thread.  In contrast, a GPU core has multiple contexts per core that are all performing the same operations.  Is it possible to use this information to assist other contexts in their branch stalls?

\section{Contribution}

\subsection{SIMT Branch Predictor Functional Unit}

Our branch predictor is based on the 2-bit Smith predictor \cite{smith}.  This predictor works with a branch history table, indexed by the PC, where each 
entry in the table stores a 2-bit branch history register.  As seen in the figure below, the branch history register is a saturating counter that represents a state machine.  
The highest order bit of the register determines the prediction to be made, and after a branch is taken or not taken, the result of the branch 
changes the state according to the state diagram shown.

\begin{center}
	\includegraphics[width=.45\textwidth]{bht.png}
	\captionof{figure}{Smith 2-bit predictor.}
\end{center}

Typically, the branch history table is indexed by the low-order bits of the PC.  This is done so that each branch in the program can have a unique register associated with it.  However, on MIMD architectures the length of the program is not limited in any particular way,
so the size of the branch history table is necessarily less then the total number of instructions in the program.  This implies that there may be collisions in the program, if the table is too small to include the whole program, where more than 
one branch maps to the same spot the branch history table.  In contrast, on SIMT architectures there is a hard limit on the maximum instruction length that may be executed by a core, and so
we may guarantee no collisions in our branch history table by taking advantage of that limit.  In our implementation, the size of the branch history table will be exactly the size of the maximum instruction stream length, which is 16,384 instructions.  We chose this size so that there would be absolutely no predictor collisions, and each instruction location will have its own dedicated predictor.  The total memory used up by this design for our predictor design is 2 bits per entry at 16k entries, for a total size of 4k.

\subsection{SIMT Functional Unit Sharing}
\label{intuition}

We propose to implement 3 GPU Branch Prediction schemes with the following features.  In order to communicate information from neighboring 
execution contexts, we share a single branch predictor between the contexts in their context group, and also between all the context groups.  Because
all contexts concurrently modify the predictor, information from one context about the branches that occured may propogate sideways in time to peer threads.

Our simulated hardware designs are as follows.
\begin{itemize}
	\item One predictor is shared amongst the contexts within a GPU core
	\item One predictor is shared only amongst its context group (1 per group)
	\item One predictor is shared only amongst its context (1 per context), as in a multicore CPU, for experimental control
\end{itemize}

\begin{center}
	\includegraphics[width=.2\textwidth]{Our-GPU---per-core-predictor.jpg}
	\captionof{figure}{One shared BHT for the entire core.}
\end{center}

\begin{center}
	\includegraphics[width=.2\textwidth]{Our-GPU---per-context-group-predictor.jpg}
	\captionof{figure}{One shared BHT for each context group.}
\end{center}

\begin{center}
	\includegraphics[width=.2\textwidth]{Our-GPU---per-element-predictor.jpg}
	\captionof{figure}{One BHT for each individual context. This implementation most closely resembles how a general CPU would handle branch prediction.}
\end{center}

The intuition for our approach is as follows.  The SIMT programming paradigm involves the preparation of a single ``kernel'' which is a short
snippet of executable code to be compiled and run in parallel on each of the threads,warps, and possibly cores.  Each instance of this kernel
running on a thread of execution is distinguished from the others by a unique identifying thread id.  Below is an example matrix-vector multiply
kernel with a single branch point.

\begin{lstlisting}[language=C,basicstyle=\footnotesize,frame=single,tabsize=4,title=GemV.cl]
__kernel gemv(double * y, double * A, double * x)
{
    int rowout = cl_thread_id(0);  // index
    y[rowout] = 0.0;
    for(int i = 0; i < A.cols; i++)
        y[rowout] += A[rowout,i] * x[i];
}
\end{lstlisting}

If this kernel was run with a traditional MIMD approach (e.g., one independent branch predictor FU per context of execution), then it is almost
impossible for that branch predictor to correctly predict branch during the final iteration of the inner for loop, as all previous executions of the loop
have been taken, it is rational for the branch predictor to predict not taken.

However, if this kernel was run with one of our SIMT branch predictors, then it IS possible for many of the threads of execution to correctly predict the final iteration
of the for loop as ``not-taken'', because when only a few of its peers mispredict, they update the shared predictor with what happens.  Due to the instruction parallel nature of the 
SIMT paradigm, if a peer mispredicts then it is likely that you will as well, allowing you to pre-emptively fix your prediction.

We will compare prediction accuracies across these three implementations.  Our goal is that after the initial training of the predictors, the GPU cores will process their contexts groups as shown in the figure below:

\begin{center}
	\includegraphics[width=.45\textwidth]{GPU-predict-context.jpg}
	\captionof{figure}{Context interleaving is not necessary, as the groups are properly predicting where to branch and continue operation.}
\end{center}

\section{Experimental Methodology}

\begin{figure*}[t]
	\begin{center}
	\includegraphics[width=.8\textwidth]{data2.png}
	\end{center}
	\caption{Simulation Results}
	\label{results}
\end{figure*}

In order to properly model our proposal, we must have an appropriate simulator modeling SIMD/SIMT architecture.  The SimpleScalar simulator was not sufficient, as it simulates CPU behaviors (including all of the behaviors that are removed from GPUs).  SimpleScalar is not sufficient, so we needed to find a GPU-specific simulator.  We found GPGPU-Sim, which is a rewrite based on the SimpleScalar simulator.  This simulator simulates an NVIDIA Quadro 5800 graphics card, with the following statistics:

\begin{itemize}
	\item 16k register file
	\item 4 GB RAM
	\item 32 GPU cores
	\item 1024 contexts
\end{itemize}

This simulator creates libcuda.so and libOpenCL.so files on the fly to intercept GPU commands from CPU program during execution.  This method of execution requires NVIDIA hardware to build benchmarks because the driver does the compilations.

\begin{center}
	\includegraphics[width=.45\textwidth]{uarch1.jpg}
	\captionof{figure}{GPGPU-Sim architecture.}
\end{center}

\begin{center}
	\includegraphics[width=.45\textwidth]{uarch2.jpg}
	\captionof{figure}{GPGPU-Sim Shader core.}
\end{center}

Since we are simulating GPU hardware, it is important to run our simulator on GPU-specific benchmarks as gcc, gzip, and other common benchmarks are designed to benchmark CPU hardware changes.  We settled on using a subset of ISPASS09 CUDA benchmarks \cite{ispass09}.  We used a subset of these benchmarks, due to problems with the CPU component or problems with the revision of the CUDA architecture supported by our simulator. (double precision vs. single precision computations).  The AES benchmark was unsuitable because none of the instructions were branches.

We settled on a suite of four benchmarks:

\begin{itemize}
	\item \emph{NQU} - Solving the NQUEENS problem
	\item \emph{NN} - Classification of images with a neural net
	\item \emph{RAY} - Raytracing a scene
	\item \emph{BFS} - Breadth-First Search of a graph
\end{itemize}

\section{Simulation Results}

We ran our simulations on all three branch prediction architectures: Shared Branch Predictor per Core, Shared BP per Context Group, Shared BP per Context (MIMD), and got the results shown in Figure \ref{results}.

There were significant improvements in prediction accuracy in the per group and per core predictors when compared to the per context predictor.  The per context predictors on \emph{BFS} scored nearly an 18\% miss rate.  However, both the per group and per core predictors scored amazing 0.9\% and 0.8\% miss rates, respectively.  While not as huge of an improvement as with the \emph{BFS}, the per group and per core predictors posted very respectable improvements over the per context predictor while operating on \emph{NN} and \emph{RAY}.

Second, we noticed that the per group and per core predictors performed worse than the per context predictor while operating on the \emph{NQU} benchmark.  It is possible that this occurred due to the nature of the \emph{NQU} problem, which does not exhibit the same kind of spatio-temporal locality as would be required to take advantage of either the per group or per core predictor information sharing.

\section{Conclusion}
We have demonstrated that the data-parallel nature of the SIMT model can be successfully used to dramatically increase prediction accuracy by sharing prediction structures accross multiple gpus.  We have shown, through experiment, that even
a simple branch predictor such as a Smith bimodal predictor benefits greatly from information given to it by peer contexts sharing dynamic branch information.  We have shown several possible configurations of this branch predictor and demonstrated the performance gains.

\subsection{Future Work}

There are several important areas of future research.

\begin{enumerate}
 \item Multiple predictors can be studied that may perform better than a simple bimodal predictor for this circumstance of peer sharing.
\item The exact impact of stall improvement latency was not studied by our approach, as our simulator did not have the appropriate timing model to model it.  Determining the 
benefit of adding predictors in terms of real CPI performance is crucial to the importance of this class of techniques.
\item Techniques were not studied to possibly share a predictor over multiple concurrent cores.  However, although this is possible, we believe this would not 
be a fruitful area of research due to the fact that multiple cores do not have locality in an program,space,or time sense.
\item The details of physical implementation deserve further study.  A large BHT per core or per context would be a significant amount of die space, and each
element in the table would need to be able to handle multiple concurrent accesses from multiple threads in a single clock, which may prove to be difficult practically.
\end{enumerate}

\bibliographystyle{IEEEtran}
\bibliography{gpu_predict_paper}


% An example of a floating figure using the graphicx package.
% Note that \label must occur AFTER (or within) \caption.
% For figures, \caption should occur after the \includegraphics.
% Note that IEEEtran v1.7 and later has special internal code that
% is designed to preserve the operation of \label within \caption
% even when the captionsoff option is in effect. However, because
% of issues like this, it may be the safest practice to put all your
% \label just after \caption rather than within \caption{}.
%
% Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
% option should be used if it is desired that the figures are to be
% displayed while in draft mode.
%
%\begin{figure}[!t]
%\centering
%\includegraphics[width=2.5in]{myfigure}
% where an .eps filename suffix will be assumed under latex, 
% and a .pdf suffix will be assumed for pdflatex; or what has been declared
% via \DeclareGraphicsExtensions.
%\caption{Simulation Results}
%\label{fig_sim}
%\end{figure}

% Note that IEEE typically puts floats only at the top, even when this
% results in a large percentage of a column being occupied by floats.


% An example of a double column floating figure using two subfigures.
% (The subfig.sty package must be loaded for this to work.)
% The subfigure \label commands are set within each subfloat command, the
% \label for the overall figure must come after \caption.
% \hfil must be used as a separator to get equal spacing.
% The subfigure.sty package works much the same way, except \subfigure is
% used instead of \subfloat.
%
%\begin{figure*}[!t]
%\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
%\label{fig_first_case}}
%\hfil
%\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
%\label{fig_second_case}}}
%\caption{Simulation results}
%\label{fig_sim}
%\end{figure*}
%
% Note that often IEEE papers with subfigures do not employ subfigure
% captions (using the optional argument to \subfloat), but instead will
% reference/describe all of them (a), (b), etc., within the main caption.


% An example of a floating table. Note that, for IEEE style tables, the 
% \caption command should come BEFORE the table. Table text will default to
% \footnotesize as IEEE normally uses this smaller font for tables.
% The \label must come after \caption as always.
%
%\begin{table}[!t]
%% increase table row spacing, adjust to taste
%\renewcommand{\arraystretch}{1.3}
% if using array.sty, it might be a good idea to tweak the value of
% \extrarowheight as needed to properly center the text within the cells
%\caption{An Example of a Table}
%\label{table_example}
%\centering
%% Some packages, such as MDW tools, offer better commands for making tables
%% than the plain LaTeX2e tabular which is used here.
%\begin{tabular}{|c||c|}
%\hline
%One & Two\\
%\hline
%Three & Four\\
%\hline
%\end{tabular}
%\end{table}


% Note that IEEE does not put floats in the very first column - or typically
% anywhere on the first page for that matter. Also, in-text middle ("here")
% positioning is not used. Most IEEE journals/conferences use top floats
% exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
% footnotes above bottom floats. This can be corrected via the \fnbelowfloat
% command of the stfloats package.


% trigger a \newpage just before the given reference
% number - used to balance the columns on the last page
% adjust value as needed - may need to be readjusted if
% the document is modified later
%\IEEEtriggeratref{8}
% The "triggered" command can be changed if desired:
%\IEEEtriggercmd{\enlargethispage{-5in}}

% references section

% can use a bibliography generated by BibTeX as a .bbl file
% BibTeX documentation can be easily obtained at:
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
% The IEEEtran BibTeX style support page is at:
% http://www.michaelshell.org/tex/ieeetran/bibtex/
%\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
%\bibliography{IEEEabrv,midterm.bib}
%
% <OR> manually copy in the resultant .bbl file
% set second argument of \begin to the number of references
% (used to reserve space for the reference number labels box)

%\begin{thebibliography}{1}

%\bibitem{IEEEhowto:kopka}
%H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
%  0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.

%\end{thebibliography}

%\bibliographystyle{ieeetr}
%\bibliography{midterm}

%\appendix
%\section{Project Progress}
%\subsection{Challenges Encountered}
%\subsection{Completed Tasks}
%\subsection{Remaining Tasks}


% that's all folks
%\bibliographystyle{ieeetr}
%\bibliography{midterm}




%\appendix
%\section{Project Progress}
%\subsection{Challenges Encountered}
%\subsection{Completed Tasks}
%\subsection{Remaining Tasks}

\end{document}


